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Abstract —Recently, deep learning approach has achieved 
promising results in various fields of computer vision. In this 
paper, a new framework called Hierarchical Depth Motion Maps 
(HDMM) + 3 Channel Deep Convolutional Neural Networks 
(3ConvNets) is proposed for human action recognition using 
depth map sequences. Firstly, we rotate the original depth data 
in 3D pointclouds to mimic the rotation of cameras, so that our 
algorithms can handle view variant cases. Secondly, in order to 
effectively extract the body shape and motion information, we 
generate weighted depth motion maps (DMM) at several temporal 
scales, referred to as Hierarchical Depth Motion Maps (HDMM). 
Then, three channels of ConvNets are trained on the HDMMs 
from three projected orthogonal planes separately. The proposed 
algorithms are evaluated on MSRAction3D, MSRAction3DExt, 
UTKinect-Action and MSRDailyActivity3D datasets respectively. 
We also combine the last three datasets into a larger one (called 
Combined Dataset) and test the proposed method on it. The 
results show that our approach can achieve state-of-the-art results 
on the individual datasets and without dramatical performance 
degradation on the Combined Dataset. 

1. Introduction 

Human action recognition has been an active research topic 
in computer vision due to its wide range of applications, such 
as smart surveillance and human-computer interactions. In the 
past decades, research on action recognition mainly focused 
on recognising actions from conventional RGB videos. 

In the previous video-based motion action recognition, 
most researchers aimed to design hand-crafted features and 
achieved significant progress. However, in the evaluation con¬ 
ducted by Wang et al. m, one interesting finding is that there 
is no universally best hand-engineered feature for all datasets. 

Recently, the release of the Microsoft Kinect brings up new 
opportunities in this field. The Kinect device can provide both 
depth maps and RGB images in real-time at low cost. Depth 
maps have several advantages compared to traditional color 
images. For example, depth maps reflect pure geometry and 
shape cues, which can often be more discriminative than color 
and texture in many problems including object segmentation 
and detection. Moreover, depth maps are insensitive to changes 
in lighting conditions. Based on depth data, many works O, 
B, m, B have been reported with respect to specific feature 
descriptors to take advantage of the properties of depth maps. 
However, all of them are based on hand-crafted features, which 
are shallow high-dimensional descriptions of local or global 
spatio-temporal information and their performance varies from 
dataset to dataset. 


Deep Convolutional Neural Networks (ConvNets) have 
been demonstrated as an effective class of models for un¬ 
derstanding image content, offering state-of-the-art results on 
image recognition, segmentation, detection and retrieval 
Q, m, 0. With the success of ImageNet classification 
with ConvNets Ga, many works take advantage of trained 
ImageNet models and achieve very promising performance 
on several tasks, from attributes classification CD to image 
representations Cl to semantic segmentation d. The key 
enabling factors behind these successes are techniques for 
scaling up the networks to millions of parameters and massive 
labelled datasets that can support the learning process. In this 
work, we propose to apply ConvNets to depth map sequences 
for action recognition. An architecture of Hierarchical Depth 
Motion Maps (HDMM) 3 Channel Convolutional Neural 
Network (SConvNets) is proposed. HDMM is a technique 
that can transform the problem of action recognition to im¬ 
age classification and artificially enlarge the training data. 
Specifically, to make our algorithms more robust to viewpoint 
variations, we directly process the 3D pointclouds and rotate 
the depth data into different views. To make full use of the 
additional body shape and motion information from depth 
sequences, each rotated depth frame is first projected onto three 
orthogonal Cartesian planes, and then for each projection view, 
the absolute differences (motion energy) between consecutive 
and sub-sampled frames are accumulated through an entire 
depth video sequence. To weight the importances of different 
motion energy, a weighted factor is used to make the motion 
energy more important for the recent poses than the past ones. 
Three HDMMs are constructed after above steps and three 
ConvNets are trained on the HDMMs. The final classification 
scores are combined by late fusion of the three ConvNets. 

We evaluate our method on the MSRAction3D, MSRAc- 
tion3DExt, UTKinect-Action and MSRDailyActivity3D 
datasets individually and achieve results which are better than 
or comparable to the state-of-the-art. To further verify the 
robustness of our method, we combine the last three datasets 
into a single one and test the proposed method on it. The 
results show that that our approach could achieve consistent 
performance without much degradation in performance on the 
combined dataset. 

The main contributions of this paper can be summarized 
as follows. First of all, we propose a new architecture, namely, 
HDMM 3ConvNets for depth-based action recognition, 
which achieves state-of-the-art results on four datasets. Sec¬ 
ondly, our method can handle view variant cases for action 


recognition to some extent due to the simply and directly pro¬ 
cessing of 3D pointclouds. Lastly, a large dataset is generated 
by combining the existing ones to evaluate the stability of the 
proposed method, because the combined dataset contains large 
variances of within actions, background, viewpoint and number 
of samples of each action across the three datasets. 

The remainder of this paper is organized as follows. Section 
2 reviews the related work on deep learning on 2D video action 
recognition and action recognition using depth sequences. Sec¬ 
tion 3 describes the proposed architecture. In Section 4, various 
experimental results and analysis are presented. Conclusion 
and future work are made in Section 5. 


II. Related Work 

With the recent resurgence of neural networks invoked by 
Hinton and others |[T4l . deep neural architectures have been 
used as an effective solution for extracting high level features 
from data. There are a number of attempts to apply deep 
architectures for 2D video recognition. In (TSl . spatio-temporal 
features are leaned unsupervised by a Convolutional Restricted 
Boltzmann Machine (CRBM) and then plugged into a ConvNet 
for action recognition. In (161, 3D convolutional network is 
used to automatically learn spatio-temporal features directly 
from raw data. Recently, several ConvNet architectures for 
action recognition in ini is compared based on Sport-IM 
dataset, comprising 1.1 M YouTube videos of sports activities. 
They find that for a network, operating on individual video 
frames, performs similarly to the networks whose input is 
the stack of frames, which indicates that the learned spatio- 
temporal features do not capture the motion effectively. In 
Cl, spatial and temporal streams, are proposed for action 
recognition. Two ConvNets are trained on the two streams 
and combined by late fusion. The spatial stream is comprised 
of individual frames while the temporal stream is stacked 
by optical flow. However, the best results of all above deep 
learning methods can only match the state-of-the-art results 
achieved by hand-crafted features. 

For depth-based action recognition, many works have been 
reported in the past few years. Li et al. 121 sample points 
from silhouette of a depth image to obtain a bag of 3D points 
which are clustered to enable recognition. Yang et al. Cl stack 
differences between projected depth maps as DMM and then 
use HOG to extract the features on the DMM. This method 
transforms the problem of action recognition from 3D space to 
2D space. In (H, HON4D is proposed, in which surface normal 
is extended to 4D space and quantized by regular polychorons. 
Following this method, Yang and Tian lO cluster hypersurface 
normals and form the poly normal which can be used to jointly 
capture the local motion and geometry information. Super 
Normal Vector (SNV) is generated by aggregating the low- 
level poly normals. However, all of these methods are based 
on carefully hand designed features, which are restricted to 
specific datasets and applications. 

Our work is inspired by ca and Ga, where we transform 
the problem of 3D action recognition to 2D image classifica¬ 
tion in order to take advantage of trained ImageNet models 

Dl. 


III. HDMM + SConvNets 

A depth map can be used to capture the 3D structure 
and shape information. By projecting the difference between 
depth maps (DMM) onto three orthogonal Cartesian planes 
can further characterize the motion information of an action 
ca. To make our method more robust to viewpoint variances, 
we directly process the 3D pointclouds and rotate the depth 
data into different views. In order to explore speed invariance 
and weight the importance of motion energy in time axis, sub¬ 
sampled and weighted HDMM is generated from the rotated 
projected maps. Three deep ConvNets are trained on three 
projected planes of HDMM. Late fusion is performed by 
combining the softmax class posteriors from the three nets. The 
overall framework is illustrated in Figure 1. Our algorithms can 
be divided into three modules: Rotation in 3D Pointclouds, 
Hierarchical DMM and Networks Training & Class Score 
Fusion. 

A. Rotation 

One of the challenges for action recognition is the view 
invariance. To handle this problem, we rotate the depth data in 
3D pointclouds, imitating the rotation of cameras around the 
subject as illustrated in Figure 2 (b), where the rotation is in 
the world coordinate system (Figure 2 (a)). 



(a) (b) 


Fig. 2. Rotation in 3D Pointclouds. 

Figure 2 (b) is the model for rotation of camera around 
the subject. Supposing camera moves from position Pq to 
it can be decomposed into two steps: first moves from Pq to 
Pt, with rotated angle denoted by 0 and moves from Pt to 
Pd, with rotated angle denoted by /3. Then the coordinates of 
subject in rotated scene can be computed by Equation (1). 

R = Rj,R,[X,y,Z,l]^ (1) 

where denotes the rotation around y axis (right-handed 
coordinate system) while denotes the rotation around z 
axis and they are: 
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After rotation, the 3D cloudpoints are projected to the 
screen coordinates as illustrated in Figure 2 (a). In this way. 
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Fig. 1. HDMM + SConvNets architecture for depth-based action recognition. 


the original depth data can be rotated to different angles, with 
the premise of not resulting in too much information loss. 

B. HDMM 

In our work, each rotated 3D depth frame is projected to 
three orthogonal Cartesian planes, including front, side and top 
views, denoted by mapp where p G {/, Different from 
113, where the motion maps are calculated by accumulating 
the difference with threshold between consecutive frames, we 
process the depth maps with three additional steps. Firstly, 
in order to reserve subtle motion information, for example, 
page turning when reading books, for each projected map, 
the motion energy is calculated as the absolute difference 
between rotated consecutive or sub-sampled frames without 
thresholding. Secondly, to effectively exploit speed invariance 
and suppress noise, several temporal scales are generated , as 
illustrated in Figure 3. For a depth video sequence with N 



Fig. 3. Illustration of hierarchical temporal scales. 


frames, HDMMp is obtained by stacking the motion energy 
across an entire depth video sequence as follows: 

b 


HDMM^ = ^ - TOap^*“2)n+i 


( 2 ) 


t=a 


where mapp denotes the frame index under projection view 
p of the whole depth video sequences and i = (t — 
t represents the sub-sampled frame in corresponding temporal 

scale n{n > 1); a G {2,3,..., A^} and max{(6—l)n+l < A^}. 

b 

Lastly, to weight the different importance of motion energy, a 
weighted HDMM is adopted as in Equation (3), making the 
motion energy more important for actions performed currently 
than past. 

HDMMl = 7 \mapl - | + SHDMMp^ (3) 

Through this simple process, pair actions, such as sit down 
and stand up, can be differentiated. 

After above three steps, the rotated HDMM are encoded 
into RGB images, with small values being encoded to R 
channel while large values to B channel. 

C. Network Training & Class Score Fusion 

After we construct the RGB images from depth motion 
maps, three ConvNets are trained on the images of the three 
projected planes. The layer configuration of our three Con¬ 
vNets is schematically shown in Figure 1, following IITQI : each 
net contains eight layers with weights, the first five convolu¬ 
tional layers and the remaining three fully-connected layers. 
Our implementation is derived from the publicly available 
Caffe toolbox ||20l based on one NVIDIA GeForce GTX680M 
card. 

Training: The training procedure is similar to ifTOl : the 
network weights are learnt using the mini-batch stochastic 
gradient descent with momentum set to 0.9 and weight decay 
set to 0.0005; all hidden weight layers use the rectification 
(RELU) activation function; at each iteration, a mini-batch of 
256 samples is constructed by sampling 256 shuffled training 
images; all the images are resized to 256 x 256; to artificially 
enlarge the training data (data augmentation), firstly, 224 x 
224 patches are randomly cropped from the center of the 




















































































selected image with a factor of 2048 data augmentation, 
and then it undergoes random horizontal flipping and RGB 
jittering; the learning rate is initially set to 10 “^ for directly 
training the networks from data without initialising the weights 
with pre-trained models on ILSVRC-2012, while it is set to 
10“^ for flne-tuning with pre-trained models on ILSVRC- 
2012 , and then it is decreased according to a flxed schedule, 
which is kept the same for all training sets; for each ConvNet 
we train 100 cycles and decrease the learning rate every 
20 cycles. For all experimental settings, we set the dropout 
regularisation ratio to 0.5 to reduce complex co-adaptations of 
neurons in nets. 

Class Score Fusion: At test period, given a depth video 
sequence (sample), we only use depth motion maps with 
temporal scaling but without rotation for testing. The averaged 
scores of n scales for each test sample are calculated as the 
flnal score of this test sample in one channel of SConvNets. 
The flnal class scores for a test sample are the averages of the 
outputs of the three ConvNets. 

IV. Experiments 

In this section, we extensively evaluate our proposed frame¬ 
work on three public benchmark datasets: MSRActionSD O, 
UTKinect-Action |(2T1 and MSRDaily Activity 3D lO. More¬ 
over, an extension of MSRActionSD, called MSRActionSDExt 
Dataset was used, which contains more subjects performing 
the same actions. In order to test the stability of proposed 
method, a new dataset are combined from the last three 
datasets, referred to as Combined Dataset. In all experiments, 
for rotation, 0 is set to (—30° : 15° : 30°) and p is set to 
(—5° : 5° : 5°); for weighted HDMM, 7 is set to 0.99 and S is 
set to 1. Different temporal scales are set according to the noise 
level and mean circle of actions performed in different datasets. 
Experimental results show that our method can outperform or 
match the state-of-the-art on individual datasets and maintain 
the accuracy on the Combined Dataset. 

A. MSRActionSD Dataset 

The MSRActionSD Dataset Ej is an action dataset of depth 
sequences captured by a depth camera. It contains 20 actions 
performed by 10 subjects facing the camera, with each subject 
performing each action 2 or 3 times. The 20 actions are: “high 
arm wave”, “horizontal arm wave”, “hammer”, “hand catch”, 
“forward punch”, “high throw”, “draw X”, “draw tick”, “draw 
circle”, “hand clap”, “two hand wave”, “side-boxing”, “bend”, 
“forward kick”, “side kick”, “jogging”, “tennis swing”, “tennis 
serve”, “golf swing”, “pick up & throw”. 

In order to facilitate a fair comparison, the same experi¬ 
mental setting in ||3l is followed, namely, the cross-subjects 
settings: subjects 1, 3, 5, 7, 9 for training and subjects 2, 
4, 6 , 8 , 10 for testing. Eor this dataset, we set temporal 
scale n = 1, and our method achieves 100% accuracy. Eour 
scenarios are considered: ( 1 ) training on primitive data set 
(without rotation and temporal scaling), ( 2 ) training on data 
set after rotation, (3) pre-training on ILSVRC-2012 (short for 
pre-trained) followed by flne-tuning on data set after rotation, 
(4) pre-trained followed by flne-tuning on primitive data set. 
The results for these setting are listed in Table 1. 


TABLE 1. Comparison on Different Training Settings eor 

MSRActionSD Dataset. 


Training Setting 

Accuracy (%) 

Primitive 

7.12% 

Rotation 

34.23% 

Rotation + Pre-trained + Fine-tuning 

100% 

Primitive + Pre-trained + Fine-tuning 

100% 


Erom this table, we can see that pre-training on ILSVRC- 
2012 (initialise the networks with the trained weights for 
ImageNet) is important, because the volume of training data 
is so small that it is not enough to train millions of parameters 
of the deep networks without good initialisation and leads to 
overfltting. If we directly train the networks from primitive, the 
performance is slightly better than random guess. We compare 
the performance of HDMM 3ConvNets with other results in 
Table 2. 

TABLE II. Recognition accuracy comparison oe our method 

AND PREVIOUS APPROACHES ON MSRACTIONSD DATASET. 


Method 

Accuracy (%) 

Bag of 3D Points (2] 

74.70% 

HOJ3D Lll 

79.00% 

Actionlet Ensemble 

82.22% 

Depth Motion Maps [19J 

88.73% 

HON4D 


88.89% 

Moving Pose 

[22I 

91.70% 

SNV 

93.09% 

Proposed Method 

100% 


The proposed method outperforms all of previous ap¬ 
proaches, this is probably because (1) In MSRAction3D we 
can easily segment the subject from background just by 
thresholding the depth values, making the generated HDMM 
without much noise ; (2) Pre-trained models can initialise the 
image-based deep networks well. 

B. MSRActionSD Ext Dataset 

The MSRAction3DExt Dataset is an extension of MSRAc- 
tion3D Dataset. It is captured with the same settings, with 
additional 13 subjects performing the same 20 actions 2 to 4 
times. Thus, there are 20 actions, 23 subjects and 1379 video 
clips. Similar to MSRAction3D, we also test our method on the 
same four scenarios and the results are listed in Table 3. Eor 
this dataset, we still adopt cross-subjects setting for training 
and testing, that is odd subjects for training and even subjects 
for testing. 

TABLE III. Comparison on Dieeerent Training Settings eor 
MSRActionSDExt Dataset. 


Training Setting 

Accuracy (%) 

Primitive 

10.00% 

Rotation 

53.05% 

Rotation + Pre-trained + Fine-tuning 

100% 

Primitive + Pre-trained + Fine-tuning 

100% 


Erom Table 1 and Table 3 we can see that with the volume 
of dataset increasing, directly training the Nets from primitive 
and rotation, the performance will be much better. However, 
the performance of trained models will still be very poor if 
pre-trained model on ImageNet is not used for initialization. 
Our method again achieves 100% using pre-trained flne- 
tuning even though this dataset has more test samples and 
variations of actions. Erom the two sets of experiments, we 
can conclude that the way using pre-trained flne-tuning is 
































very suitable for small datasets. In the following experiments, 
we do not train our networks from primitive any more and all 
of the experiments adopt pre-trained + fine-tuning settings. 

We compare the performance of our method with SNV (H 
in Table 4 and our method can outperform the state-of-the-art 
result dramatically. 

TABLE IV. Recognition accuracy comparison of our method 
AND SNV ON M SR Actions DExt Dataset. 


Method 

Accuracy (%) 

SNV 

90.54% 

Proposed Method 

100% 


C. UTKinect-Action Dataset 

The UTKinect-Action Dataset ED is captured using a 
stationary Kinect sensor. It consists of 10 actions: “walk”, “sit 
down”, “stand up”, “pick up”, “carry”, “throw”, “push”, “pull”, 
“wave” and “clap hands”. There are 10 different subjects 
and each subject performs each action twice. This dataset is 
designed to investigate variations in the view point. 

For this dataset, we set temporal scale n = 5, to ex¬ 
ploit more temporal information in actions. The cross-subjects 
scheme is followed as in 1^ which are different from lED 
where more subjects were used for training in each round. We 
consider three scenarios for this dataset: (1) pre-trained fine- 
tuning on primitive data set; (2) pre-trained fine-tuning on 
data set after rotation (3) pre-trained fine-tuning on data set 
after rotation and temporal scaling. The results are listed in 
Table 5. 

TABLE V. Comparison on Different Training Settings for 
UTKinect-Action Dataset. 


Training Setting 

Accuracy (%) 

Primitive + Pre-trained + Fine-tuning 

82.83% 

Rotation + Pre-trained + Fine-tuning 

88.89% 

Rotation + Scaling + 
Pre-trained + Fine-tuning 

90.91% 


From Table 5 we can see that after rotation, it can obtain 
6% improvement in terms of accuracy, which shows that the 
process of rotation in our method can improve the accuracy 
greatly. The confusion matrix for the final test is demonstrated 
in Figure 4 and it shows that the most confused actions are 
hand clap and wave, which share similar shapes of depth 
motion maps. 



Fig. 4. The confusion matrix of proposed method for UTKinect-Action 
Dataset. 

Table 6 shows the performance of our method compared 
to the previous approaches on the UTKinect-Action Dataset, 


and it shows that the performance of proposed method can 
outperform the methods specially designed for view variant 
cases. 

TABLE VI. Recognition accuracy comparison of our method 

AND PREVIOUS APPROACHES ON UTKINECT-ACTION DATASET. 


Method 

Accuracy (%) 

DSTIP+DCSF |23j 

85.8% 

Random Forests 1241 

87.90% 

SNV 

88.89% 

Proposed Method 

90.91% 


D. MSRDailyActivity 3D Dataset 

The MSRDailyActivitySD Dataset fSl is a daily activity 
dataset of depth sequences captured by a depth camera. There 
are 16 activities: “drink”, “eat”, “read book”, “call cellphone”, 
“write on paper”, “use laptop”, “use vacuum cleaner”, “cheer 
up”, “sit still”, “toss paper”, “play game”, “lay down on sofa”, 
“walking”, “play guitar”, “stand up” and “sit down”. There are 
10 subjects and each subject performs each activity twice, one 
in standing position and the other in sitting position. Compared 
to MSRAction3D(Ext) and UTKinect-Action datasets, actors 
in this dataset present large spatial and scaling changes. 
Moreover, most activities in this dataset involve human-object 
interactions. 

For this dataset, we set temporal scale n = 21, a larger 
number of scales, to exploit more temporal information and 
suppress the high level noise in this dataset. We follow the 
same experimental setting as El and obtain the final accuracy 
of 81.88%. Three scenarios are considered for this dataset: (1) 
pre-trained -f fine-tuning on primitive data set; (2) pre-trained 
-f fine-tuning on data set after temporal scaling; (3) pre-trained 
-f fine-tuning on data set after temporal scaling and rotation. 
The results are listed in Table 7. 

TABLE VII. Comparison on Different Training Settings for 
MSRDailyActivitySD Dataset. 


Training Setting 

Accuracy (%) 

Primitive + Pre-trained + Fine-tuning 

46.25% 

Scaling + Pre-trained + Fine-tuning 

75.62% 

Scaling + Rotation + 
Pre-trained + Fine-tuning 

81.88% 


The performance of our method compared to the previous 
approaches is shown in Table 8 and the confusion matrix is 
shown in Figure 5. 

TABLE VIII. Recognition accuracy comparison of our 

METHOD AND PREVIOUS APPROACHES ON MSRDAILYACTIVITYSD 

Dataset. 


Method 

Accuracy (%) 

LOP (3J 

42.50% 

Depth Motion Maps 1191 

43.13% 

Joint Position 

68.00% 

Moving Pose [22J 

73.8% 

Local HON4D (41 

80.00% 

Actionlet Ensemble 

85.75% 

SNV 

86.25% 

Proposed Method 

81.88% 


From Table 8 we can see that our proposed method can 
outperform DMM |[T9ll greatly but can only match the state-of- 
the-art methods. The reasons probably are: (1) the background 
of this dataset is more complicated compared to MSRAction3D 
(Ext), we only pre-process by thresholding the depth value 
















































Fig. 5. The confusion matrix of proposed method for MSRDailyActivity 
Dataset. 

as for MSRActionSD (Ext), which causes lots of noise in 
HDMM; (2) there are so many actions that are similar, such as 
call cellphone, drink and eat, they have similar motion shapes 
but have subtle motion energies so that the HDDM are very 
similar and confusing. 

E. Combined Dataset 

The Combined Dataset is a dataset consisting of 
MSRActionSDExt, UTKinect-Action and MSRDailyActiv- 
itySD datasets to test the scalability of the proposed methods. 
The Combined Dataset is very challenging due to its large 
variations in backgrounds, within actions, view points and 
number of samples in each action. The same actions in 
different datasets are combined into one action and rewritten 
into same file format (.bin file format) and filename format 
(axxx_sxxx_exxx_depth.bin, where a, s and e represent class 
ID, subject ID and example ID respectively, xxx represent the 
corresponding number) in Combined Dataset. The combined 
actions and corresponding information are listed in Table 9. 

TABLE IX. Information on Combined Dataset: A stands eor 
MSRActionSDExt Dataset; U stands eor UTKinect-Action 
Dataset; D stands eor MSRDailyActivitySD Dataset; the 
CORRESPONDING Action Names are shown in Figure 6 (Y label oe 

THE EIETH SUB-EIGURE). 


Action 

Label 

Subject 

Label 

Original 

Datasets 

Action 

Label 

Subject 

Label 

Original 

Datasets 

1 

s001-s023 

A 

21 

s001-s020 

U&D 

2 

s001-s023 

A 

22 

s001-s020 

U&D 

3 

s001-s023 

A 

23 

s001-s020 

U&D 

4 

s001-s023 

A 

24 

sOOl-sOlO 

u 

5 

s001-s023 

A 

25 

sOOl-sOlO 

u 

6 

s001-s023 

s025-s034 

A&U 

26 

OOl-sOlO 

u 

7 

s001-s023 

A 

27 

sOOl-sOlO 

u 

8 

s001-s023 

A 

28 

sOOl-sOlO 

D 

9 

s001-s023 

A 

29 

sOOl-sOlO 

D 

10 

s001-s023 

s025-s034 

A&U 

30 

sOOl-sOlO 

D 

11 

s001-s023 

s025-s034 

A&U 

31 

sOOl-sOlO 

D 

12 

s001-s023 

A 

32 

sOOl-sOlO 

D 

13 

s001-s023 

A 

33 

sOOl-sOlO 

D 

14 

s001-s023 

A 

34 

sOOl-sOlO 

D 

15 

s001-s023 

A 

35 

sOOl-sOlO 

D 

16 

s001-s023 

A 

36 

sOOl-sOlO 

D 

17 

s001-s023 

A 

37 

sOOl-sOlO 

D 

18 

s001-s023 

A 

38 

sOOl-sOlO 

D 

19 

s001-s023 

A 

39 

sOOl-sOlO 

D 

20 

s001-s023 

A 

40 

sOOl-sOlO 

D 


Eor this dataset, we still use cross-subjects scheme: odd 
subject ID for training and even subject ID for testing, guar¬ 


anteeing the training and testing subjects in original datasets 
with the same identities in Combined Dataset. 

To compromise between different datasets, we set temporal 
scale n = 5 for this dataset. Our algorithms are tested with 
two settings: (1) pre-trained -f fine-tuning on primitive data set; 
(2) pre-trained -f fine-tuning on data set after temporal scaling 
and rotation. The corresponding results are listed in Table 10. 

TABLE X. Comparison on Two Training Settings eor the 
Overall Results on Combined Dataset. 


Training Setting 

Accuracy (%) 

Primitive + Pre-trained + Fine-tuning 

87.20% 

Rotation + Scaling + 
Pre-trained + Fine-tuning 

90.92% 


Erom Table 10 we can see that for this big dataset, rotation 
and scaling are less effective compared with that on smaller 
dataset. One probable reason is that the training of ConvNets 
can benefit from large primitive data set to fine-tune millions 
of parameters. 

We compare our method with SNV 10 on this dataset: 
first train one model in the Combined Dataset and then test 
the model on original datasets and Combined Datasets. The 
results and corresponding confusion matrix are shown in Table 
11 and Eigure 6. 

TABLE XL Recognition accuracy comparison oe SNV and 

OUR METHOD ON COMBINED DATASET AND ITS ORIGINAL DATASETS. 


Dataset 

Method 

SNV 

Proposed Method 

MSRActionSD 

89.83% 

94.58% 

MSRActionSDExt 

91.15% 

94.05% 

UTKinect-Action 

93.94% 

91.92% 

MSRDaily ActivitySD 

60.63% 

78.12% 

Combined 

86.11% 

90.92% 


Erom Table 11 we can see that proposed method can 
maintain the accuracy without dramatically variation with the 
increase of variations and complications of datasets; at the 
same time, it outperforms the SNV method largely in terms of 
accuracy in the Combined Dataset. The compromise between 
different datasets in temporal scales lead to subtle decrease 
of accuracy in some individual datasets. Eor example, for 
MSRDailyActivitySD, decreasing the temporal scale n = 21 
to n = 5 leads to the drop in accuracy, due to the loss 
of shape and speed information in this dataset, which has 
high level noise and complicated background. However, for 
MSRActionSD (Ext), the accuracy drops because some of the 
actions are similar after temporal scaling, such as hand catch 
and high throw, due to the simplicity of these datasets. 

F. Analysis 

In this section, we give the overall analysis of proposed 
methods on data augmentation and parameter selection accord¬ 
ing to the extensive experimental results. 

In our method, ConvNets are adopted for feature extraction 
and classification. Generally, ConvNets needs large volume 
data to tune millions of parameters and reduce overfitting. 
Directly training the ConvNets with small size of data will 
lead to very poor performance due to overfitting, which can 
be seen from Table 1 and Table 3. Due to the small volume 
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Fig. 6. The confusion matrix of proposed method for original MSRAc- 
tionSD, MSRActionSDExt, UTKinect-Action, MSRDailyActivity and Com¬ 
bined datasets (from top left to down right). 


(even the Combined Dataset) of available datasets, artificially 
data augmentation is needed. In our method, two strategies 
are used for this purpose: rotation and temporal scaling, one 
for view point invariance, one for speed invariance. However, 
without initialising the ConvNets with pre-trained model on 
ImageNet, the artificially enlarged data set are still not enough 
to train the whole nets from random initialization of the million 
of weights, because much of the data is interdependent and less 
informative. The way of pre-trained -i- fine-tuning provides a 
promising direction for small datasets in our method. The pre¬ 
trained model can initialise the loss function of the nets into 
minimum areas and the fine-tuning can make the nets obtain 
the optimum solution even on small datasets. 

For different datasets, different temporal scales are set to 
obtain the best results. The reasons are as follows: for simple 
action datasets (or gesture datasets), such as MSRActionSD 
(Ext), one scale is enough to distinguish the differences be¬ 
tween actions (gestures), due to the short circle of motion; 
for activity datasets, such as UTKinect-Action and MSRDai¬ 
lyActivity datasets, more scales are needed, because the circle 
of motion in these datasets are much longer and they usually 
contains several primitive actions (gestures) and large number 
of scales can capture the motion information in different 
temporal scales; For noisy datasets, larger temporal scales 
should be set, such as MSRDailyActivity Dataset, because 
in order to suppress the affects of high level noise in these 
datasets, much scales need to be set. 


V. Conclusion and Future Work 

In this paper, we propose a deep classification model for 
action recognition using depth map sequences. Our proposed 
method is evaluated on extensive datasets and compared to a 
number of state-of-the-art approaches. Our method can achieve 
state-of-the-art results in individual datasets and maintain 
the accuracy in more complicated datasets combined from 
available public datasets. By rotation and temporal scaling, the 
volume of training data can be artificially enlarged, from which 
the ConvNets benefit and obtain better results than training on 
primitive. The way of pre-trained fine-tuning is adopted to 
train ConvNets on small datasets, which achieves state-of-the- 
art results in most cases. However, due to the high level noise 
and complicated background of some datasets, our method can 
only compete with previous methods. Moreover, our method 
does not consider the skeleton data on which much success has 
achieved for action recognition. With the development of deep 
learning methods, the combination of Deep Belief Networks 
for skeleton data (generative model) and deep Convolutional 
Neural Networks for depth data (discriminative model) will 
open a new door for 3D action recognition. In our future 
work, we will combine the proposed method together with 
object segmentation and skeleton-based method to improve the 
recognition accuracy. 


References 

[1] H. Wang, M. M. Ullah, A. Klaser, I. Laptev, C. Schmid et al, 
“Evaluation of local spatio-temporal features for action recognition,” 
in BMVC, 2009. 

[2] W. Li, Z. Zhang, and Z.Liu, “Action recognition based on a bag of 3D 
points,” in CVPRW, 2010. 

[3] J. Wang, Z. Liu, Y. Wu, and J. Yuan, “Mining actionlet ensemble for 
action recognition with depth cameras,” in CVPR, 2012. 

[4] O. Oreifej and Z. Liu, “Hon4d: Histogram of oriented 4d normals for 
activity recognition from depth sequences,” in CVPR, 2013. 

[5] X. Yang and Y. Tian, “Super normal vector for activity recognition 
using depth sequences,” in CVPR, 2014. 

[6] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hierarchi¬ 
cal features for scene labeling,” PAMI, vol. 35, no. 8, pp. 1915-1929, 

2013. 

[7] R Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. Le¬ 
Cun, “Overfeat: Integrated recognition, localization and detection using 
convolutional networks,” in ICLR, 2014. 

[8] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnn fea¬ 
tures off-the-shelf: An astounding baseline for recognition,” in CVPRW, 

2014. 

[9] P. Agrawal, R. Girshick, and J. Malik, “Analyzing the performance of 
multilayer neural networks for object recognition,” in ECCV, 2014. 

[10] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification 
with deep convolutional neural networks,” in NIPS, 2012. 

[11] N. Zhang, M. Paluri, M. Ranzato, T. Darrell, and L. Bourdev, “Panda: 
Pose aligned networks for deep attribute modeling,” in CVPR, 2014. 

[12] M. Oquab, L. Bottou, 1. Laptev, J. Sivic et al, “Learning and 
transferring mid-level image representations using convolutional neural 
networks,” in CVPR, 2014. 

[13] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature 
hierarchies for accurate object detection and semantic segmentation,” 
in CVPR, 2014. 

[14] G. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for 
deep belief nets,” Neural Computation, vol. 18, no. 7, pp. 1527-1554, 
2006. 

[15] G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler, “Convolutional 
learning of spatio-temporal features,” in ECCV, 2010. 




[16] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks 
for human action recognition,” PAMI, vol. 35, no. 1, pp. 221-231, 2013. 

[17] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and 
L. Fei-Fei, “Large-scale video classification with convolutional neural 
networks,” in CVPR, 2014. 

[18] K. Simony an and A. Zisserman, “Two-stream convolutional networks 
for action recognition in videos,” in NIPS, 2014. 

[19] X. Yang, C. Zhang, and Y. Tian, “Recognizing actions using depth 
motion maps-based histograms of oriented gradients,” in ACM MM, 
2012. 

[20] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, 
S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for 
fast feature embedding,” arXiv:1408.5093, 2014. 

[21] L. Xia, C.-C. Chen, and J. Aggarwal, “View invariant human action 
recognition using histograms of 3d joints,” in CVPRW, 2012. 

[22] M. Zanfir, M. Leordeanu, and C. Sminchisescu, “The moving pose: An 
efficient 3d kinematics descriptor for low-latency action recognition and 
detection,” in ICCV, 2013. 

[23] L. Xia and J. Aggarwal, “Spatio-temporal depth cuboid similarity 
feature for activity recognition using depth camera,” in CVPR, 2013. 

[24] Y. Zhu, W. Chen, and G. Guo, “Fusing spatiotemporal features and 
joints for 3d action recognition,” in CVPRW, 2013. 



